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IDENTIFICATION AND MAPPING OF SINGLE NUCLEOTIDE 
POLYMORPHISMS IN THE HUMAN GENOME 

(HALE AND DORR NO. 108827, 129) 
RArt^nROIIND OF THF. INVENTION 
pitflH of the invention 

The invention relates to the role of genes in human diseases. More 
particularly, the invention relates to compositions and methods for identifying 
genes that are involved in human disease conditions. 

f^iimmarv o f fhp related art 

During the past two decades, remarkable developments in molecular biology 
and genetics have produced a revolutionary growth in understanding of the 
implication of genes in human disease. Genes have been shown to be direcdy 
causative of certain disease states. For example, it has long been known that sickle 
ceU anemia is caused by a single mutation in the human beta globin gene. In many 
other cases, genes play a role together with enviromnental factors and/or other 
genes to either cause disease or increase susceptibility to disease. Prominent 
examples of such conditions include the role of DNA sequence variation in ApoE in 
Alzheimer's disease, CKR5 in susceptibility to infection by HIV; Factor V in risk of 
deep venous thrombosis; MIHFR in cardiovascular disease and neural tube defects; 
p53 in HPV infection; various cytochrome p450s in drug metabolism; and HLA in 
autoimmune disease. 

Surprisingly, the genetic variations that lead to gene involvement in human 
disease are relatively small. Approximately 1% of the DNA bases which comprise 
the human genome contain polymorphisms that vary at least 1% of the time in the 
human population. The genomes of all organisms, including humans, undergo 
spontaneous mutation in the course of their continuing evolution. The majority of 
such mutations create polymorphisms, thus the mutated sequence and the initial 
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differences are toctioaaUy inconsequenaa. in to. ftey neither affect the ammo 
add sequence of encoded proteins nor the expression leveU of the encoded proteins, 
some polymorphisms that lie within genes or their promoters do have a 

phenotypic effect and it is this small proportion of the genome's variaHon that 
accounts for the genetic component of aU difference between individuals, e.g., 
physical appearance, disease susceptibiUty, disease resUtance, and responsiveness to 

drug treatments. 

The relation between human genetic variability and human phenolype is a 
central theme in modern human genetic studies. The human genome compr.es 
approximately 4 bilUonbases of DNA. The Human Genome Project is uncovenng 
more and more of the of the consensus sequence of ttus genome. However, there 
remains a need to identify the nature and location of genetic variations that are 
implicated in human disease conditions. 

Sequence variation in the human genome consists primarUy of single 
nucleotide polymorphisms (^NPs") with the remainder of the sequence variations 
being short tandem repeats Onduding microsatelUtes), long tandem repeats 
(mirusateUite) and otiier ir^ertions and deletions. A SNP is a position at which two 
alternative bases occur a. appreciable frequency (i.e. >1%) in the human population 
A SNP is said to be 'alleUc- in that due to the existence of the polymorphism son^e 
menders of a species may have the unmutated sequence (i.e., the original alle e 
whereas other members may have a mutated sequence (i.e., the vari»t or mutant 
allele). In the simplest case, only one mutated sequence may exist, and the 
polymorphism is said to be dialleUc. The occurrence of alternative mutations can 
give rise to triaUeUc polymorphisms, etc. SNPs are widespr^d Iteoughout the 
genome and SNPs that alter the function of a gene may be direct contributors to 
phenotypic variation. Due to their prevalence and widespread nature, SNPs have 
potential to be important tools for locating genes that are involved in human 
disease conditions. Wang ct al, Science m 1077-1082 (1998), di«:loses a pilot smdy 
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in which 2,227 SNPs were mapped over a 2.3 megabase region of DNA. 

To be useful for locating and identifying genetic variations linked to human 
disease, however, it is necessary to identify and map a much larger number of SNPs, 
and to do so throughout the human genome. There is therefore a need for the 
identification and mapping of a very large number of SNPs throughout the entire 
human genome. 
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pRIFF STIMMAPy ''^PP INVENTION 
The invention provides identification and mapping of a very large number of 
SNPs throughout the entire human genome. 

In a first aspect, the invention provides SNP probes which are useful in 
classifying people according to their genetic variation. The SNP probes according to 
the invention are oligonucleotides which can discriminate between alleles of a SNP 
nucleic acid in conventional allelic discrimination assays. 

In a second aspect, the invention provides methods for using a large-scale 
1 map of SNPs throughout the human genome to isolate and identify genes that are 
i| relevant to the prevention, causation, or tireatinent of human disease conditions. 
Preferred embodiments of this aspect of the invention include Unkage studies in 
families, linkage disequilibrium in isolated populations, association analysis of 
patients and conbrols and loss-of-heterozygosity studies in tumors. 
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RRIFF nPfirRTPTION HP THF DRAWINGS 

Figure 1 depicts the number of human restriction fragments with sizes in a 
200 bp range centered on a given point for a typical six-cutter restriction enzyme. 



nPSCRimoN thf. SEOU RNrR listing 

A sequence listing is being provided with this provisional application on the 
accompanying Jaz disk. For each SEQ ID NO. is shown the polymorphism within 
the consensus sequence, the position of the polymorphism in the consensus 
5 sequence along with the identity of the polymorphism and frequency of the alleles, 
and the map location of the identified sequence. For example, for a polymorphism 
in which "a" is identified 4 times and "t" is identified 2 times within a consensus 
sequence at position 35 from the 5' end, the text identifying the sequence will read 
"SEQ ID NO. ###; polymorphism=w; position=35; alleles=a(4)t(2)." In some cases, 
m the polymorphism consists of a single base deletion. In tiiis case, the deleted base is 
S indicated as a hyphen (-). The map location of the listed sequence is desaibed by 
^ each of the various means which were used to identify the location, including tiie 
W following: 

: 1) base location relative to GenBank hit is Usted as "sequence=ACC/Off" 

m where "Acc" is the accession number of ti\e matching GenBank entry and "Off is 
U the offset of the polymorphism from the start of the GenBank entry, for example, 
P ■'sequence=M39218 /98112" indicates tiiat the polymorphism is 98,112 base pairs 
offset from the start of GenBank entry M39218. 

2) chromosome number is listed as chromosome=N, where N is the 
20 chromosome number, for example "chromosome=12". 

3) cytogenetic position is listed as cytogenetic=I, where I is tiie cytogenetic 
position, for example "cytogenetic=lql2.3". 

4) radiation hybrid ("rh") position relative to a GenBank entry is listed as 
rh=Acc/Offset (P), where "Acc" is tiie accession number of the relative GenBank 

25 entry, "Offset" is the centiray distance from the relative Genbank entiry, and "(P)" is 
the radiation hybrid panel used. For example "rh=M39128/21.2 (TNG)" indicates 
that the sequence is located 21.2 centiray from GenBank enh:y M39128 using the 
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TNG radiation hybrid panel. Multiple map coordinates may be provided for any 
SEQ ID NO. and each coordinate is separated by a space, for example 
"map location=[chromosome=12 rh=M39128/21.2(TNG) cytogenetic42ql8.1]." 
When the map position is unknown, the map fields are blarJc. 
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DETAILED DBSCRIPTION OF THE PREFF-RRRD EMBO DIMENTS 

The invention relates to the role of genes in human diseases. More 
particularly, the invention relates to compositions and methods for identifying 
genes that are involved in human disease conditions. Any patents and publications 
cited herein reflect the knowledge in this field and are hereby incorporated by 
reference in entirety. Any conflict between any reference cited herein and the 
specific teachings of this specification shall be resolved in favor of the latter. 

The invention provides identification and mapping of a very large number of 
SNPs throughout the entire human genome. This contribution allows scientists to 
isolate and identify genes that are relevant to the prevention, causation, or 
treatment of human disease conditions. 

In a first aspect, the invention provides SNP probes which are useful in 
classifying people according to their genetic variation. The SNP probes according to 
the invention are oligonucleotides which can discriminate between alleles of a SNP 
nucleic acid in conventional allelic discrimination assays. As used herein, a "SNP 
nucleic acid" is a nucleic acid sequence which comprises a nucleotide which is 
variable within an otherwise identical nucleotide sequence between individuals or 
groups of individuals, thus existing as alleles. Such SNP nucleic acids are preferably 
from about 15 to about 500 nucleotides in length. The SNP nucleic acids may be part 
of a chromosome, or they may be an exact copy of a part of a chromosome, e.g., by 
amplification of such a part of a chromosome through PGR or through cloning. 

The SNP probes according to the invention are oligonucleotides that are 
complementary to a SNP nucleic acid. The term "complementary" means exactly 
complementary throughout the length of the oligonucleotide in the Watson and 
Crick sense of the word. In certain preferred embodiments, the oligonucleotides 
according to this aspect of the invention are complementary to one allele of the SNP 
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nucleic acid, but not to any other allele of the SNP nucleic acid. Oligonucleotides 
according to this embodiment of the invention can discriminate between alleles of 
the SNP nucleic acid in various ways. For example, under stringent hybridization 
conditions, an oligonucleotide of appropriate length will hybridize to one allele of 
5 the SNP nucleic acid, but not to any other allele of the SNP nucleic acid. (See e.g., 
Saiki et al, Proc. Natl. Acad. Sci. USA S6: 6230-6234 (1989)). For this appUcation, 
preferred oligonucleotide lengths are from about 15 nucleotides to about 25 
nucleotides. Preferred final hybridization conditions for this application are 2x PBS 
1=. at room temperature. Preferably, the oligonucleotide is labeled, most preferably by a 
% radiolabel, an enzymatic label, or a fluorescent label. Alternatively, an 
B oUgonucleotide of appropriate length can be used as a primer for PGR. wherein the 
S 3' terminal nucleotide is complementary to one allele of the SNP nucleic acid, but 

1 not to any other allele. In this embodiment, the presence or absence of 
[ amplification by PGR determines the haplotype of the SNP nucleic acid. 

% To identify the SNP nucleic acids (sometimes referred to hereafter simply as 

t; "SNPs") present in the human genome, a whole genome approach was taken to 

2 identify SNPs on a large scale. The method described in the following examples, 
termed the "Reduced-Representation Shotgun" or "RRS", was utUized as it allows 
the random sequencing of a specific subset (e.g., 1%) of the genome from a collection 

20 of individuals. 

Our intent was to sequence each fraction of the genomic DNA to a depth of 
2.5-5X coverage. This level of coverage was determined through a calculation of 
Poisson sampling for different levels of SNP allele frequency. Briefly, the 
proportion of SNPs identified increases with the depth of coverage of the 

25 sequencing (the sequencing of a fragment from one individual provides Ix of 

coverage and the sequencing of the same fragment from each additional individual 
provides and additional Ix of coverage), and more common SNPs are more rapidly 
detected than less common SNPs. The efficiency of detection, or number of SNPs 
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detected per additional Ix depth of coverage, however, peaks at about 2.5x coverage 
and diminishes significantly when greater than 5x coverage is obtained (calculation 
not shown). 

The distribution of restriction sites tends to be uniform across the human 
genome (with the exception of restriction sites containing the CpG dinucleotide). 
Thus, the proportion of the genome present in any size fraction can be varied by the 
size and extent of the fraction taken. For example, in a survey of available genomic 
sequence data on chromosomes 22 and X, the frequency and distribution of 
restriction fragments was examined, see Table 1. 

Table 1. Distribution of Restriction Fragments in Genomic Sequence. 



Enzyme 


EcoRI 


EcoRV 


BamHI 


Hindin 


Hindni 


Chromosome 


22 


22 


22 


22 


X 


Size Range (kb) 












1-2 


40.9 


13.7 


29.7 


44.6 


67.6 


2-3 


33 


12.6 


24.8 


32.7 


46.6 


34 


27 


9.4 


18.5 


26.2 


34.5 


4-5 


17.3 


9.5 


15 


20.9 


23.8 


5-7 


28.3 


15 


22.1 


25.8 


29.3 


7-9 


16.2 


8.7 


15.4 


16 


15.6 


9-11 


10 


9.1 


11.9 


8.5 


8.6 
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(Values are given as number of fragments per Mb, calculated from analysis if 14Mb 
or 22Mb of genomic sequence on chromosomes 22 or X, respectively) 

Chromosome-specific variation of restriction site distribution is iUustrated by 
a comparison of the HindHI analysis for chromosomes 22 and X. For this reason, 
RRS plasmid libraries made using different restriction enzymes are quite useful 
The results of restriction fragment distribution shown in Table I above indicate that 
for the approximately 50 Mb of chromosome 22, about 850 distinct fragments will 
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theoretically be present in a 2-2.5 kb fraction of HindHI or EcoRI fragments, and a 5x 
coverage of the sequence of both ends of these fragments requires approximately • 
11,000 reads. In practice about 25% more reads were taken as each fraction contains 
some spillover of fragments from adjacent size fractions. 

The number of restriction sites in the entire human genome for a typical six- 
cutter restriction enzyme can be calculated and plotted as shown in Figure 1. As 
shown in Figure 1, there are roughly 33,000 fragments in the range of 400-600 bp, and 
about 22,000 fragments ir^ the range 1.9-2.1 kb. Each 400-600bp fragment could be 
sequenced in a single sequencing reaction, and each 1.9-2.1kb fragment could be 
sequenced in two sequencing reactions, one from each end. Thus it is apparent that 
approximately 33,000 reads of fragment in the range 400^0bp or 44,000 sequencing 
reads would each provide Ix coverage of the SNPs present in the selected fraction of 
the human genome. 

The oligonucleotides according to this aspect of the invention are useful for 
identifying people according to their haplotype for a panel of SNP nucleic acids. 
This can be acheived by obtaining a nucleic acid sample from an individual and 
using the oligonucleotides according to the invention to assay for which allele the 
individual has for a particular set of SNP nucleic acids disclosed herein, as discussed 
above. If a sufficienUy large number of SNP nucleic adds are assayed, a unique 
haplotype can be established as a reference for that individual. Subsequently, if a 
biological sample which may be from that individual needs to be identified, for 
forensic purposes, the oligonucleotides according to the invention can be used m 
identical assays on the biological sample, and the results can be compared to the 
reference haplotype to determine whether the biological sample is from the same 
individual. The oligonucleotides according to the invention are also useful in 
studies to determine the relevance of various genes to the prevention, causation or 
treatment of various human disease conditions, as further discussed below. 
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Thus, in a second aspect, the invention provides methods for using a large- 
scale map of SNPs throughout the human genome to isolate and identify genes that 
are relevant to the prevention, causation, or treatment of human disease 
conditions. Preferred embodiments of this aspect of the invention include linkage 
studies in families, linkage disequilibrium in isolated population, association 
analysis of patients and controls and loss-of-heterozygosity studies in tumors. 

The SNP map and its methods of use according to this aspect of the invention 
transform the search for susceptibility genes through the use of association studies 
and through the use of linkage disequilibrium shidies. Linkage disequilibrium 
shidies are indirect shidies in which an investigator seeks to identify ti^e presence of 
common ancestral chromosomes among susceptible individuals. Association 
shadies are direct studies in which an investigator tests whether a genetic variant 
increases disease risk by comparing allele frequencies in affecteds and conhrols. 
Association shxdies make possible the identification of genes with relatively 
common variants that confer a modest or small effect on disease risk, which is 
precisely the type of gene expected in ti^e most complex disorders. Association 
studies are logistically simpler to organize and are potentially more powerful than 
family-based linkage sti^dies, but they have previously had the practical limitation 
that one can only test a few guesses rather than being able to systematically scan the 
entire genome. In the method according to ti^e invention, association stiidies can be 
extended to include a systematic search tiuough the entire list of common variants 
in the human genome to reveal the identity of the gene or genes underlying any 
phenotype not due to a rare allele. The SNP map of the human genome provided 
by the invention will make it possible to test disease susceptibility against every 
common variant simultaneously, for example, by genotyping a well-characterized 
clinical population with a comprehensive DNA array. 

The SNP map used in this aspect of the invention can be prepared using a 
variety of methods. One traditional method of mapping the locus of a SNP is to 
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create a PCR assay to amplify the locus and then to perform genetic mapping or 
whole-genome radiation hybrid ("RH") mapping. Another method for mapping ' 
the locus of a SNP is "in silico mapping" in which the SNP and its flanking , 
sequence is "BLASTed" against the publicly available sequence, such as the sequence 
managed by NCBI or GenBank, in order to identify the genomic overlaps that will 
positionaUy map the SNPs. We utilized both RH mapping and in sUico mapping to 
map the locus of the SNPs. 

The location of the idenUfied SNPs was mapped by RH mapping onto the 
existing Stanford TNG panel through developing each SNP as an STS. The TNG 
panel was chosen for mapping as it has been shown to order new STS's with greater 
than 95% confidence at 100 kb resolution. The Stanford ING panel consists of 90 
independent hybrids with an average human marker retention per hybrid of 19%. 
This panel was constructed with 50,000 rad of irradiation, resulting in human 
chromosomal fragments 300kb average size. The practical resolution of the WG 
panel is 21 kb. One can think of the TNG panel as a "clone Ubrary", representing a 
17.fold redundancy of the human genome, with a human insert size of 300 kb and 
333,000 detectable ends. 

This map can be used for conventional linkage studies in families, linkage 
disequilibrium studies in isolated population, association analysis of patients and 
controls and loss^f-heterozygosity studies in tumors. For example, the linkage 
disequilibrium method of Hastbacka et «/., Nature Genetics 2: 204-211 (1992), can be 
used, substituting SNPs according to the invention for the RFLPs used in that 
report. Briefly, linkage disequilibrium mapping is based on the observation that 
chromosomes having a gene associated with disease which are descended from a 
common ancestral mutation should show a distinctive haplotype in the immediate 
vicinity of the gene, reflecting the haplotype of the ancestral chromosome. For 
example, the method is particularly useful when there is a single disease<ausmg 
allele with a high frequency, so that the excess of an ancestral haplotype can be 
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detected easily, and when the allele was introduced into the population sufficiently 
long ago that recombination has made the region of strongest linkage relatively « 
small. Population genetics are then used to determine how much recombination 
should be expected between the gene and one or more nearby SNPs of known map 
location, thus locating the gene with respect to the SNP map. 

The following examples are intended to further illustrate certain preferred 
embodiments of the invention, and are not intended to be limiting in nature. 

Example 1 

finnin g and id ^"Hfiration of ^MP ""Heic acids 
Genomic DNA was isolated from a pluraUty of unrelated human individuals 
and approximately equal amounts from each individual was pooled. The combined 
genomic DNA was then cut to completion with one of the following restriction 
enzymes: Hindm, EcoRI, EcoRV, and BamHI. Other restriction enzymes are also 
useful. The digested genomic DNA was then run on a preparative agarose gel along 
with size markers. The agarose gel containing the electrophoresed DNA was cut 
into size fractions such that a size range of about 200 base pairs was present in each 
sUce (e.g., 500-700 base pairs, 1000-1200 base pairs, 2200-2400 base pairs). The DNA 
was extracted from the gel. Eluted size fractionated DNA fragments were ligated 
into a phosphatased vector which had been cut using the same restriction enzyme as 
was used for the digestion of the genomic DNA. Plasmid libraries were prepared by 
transforming E.coli with the ligated vectors according to well known methods of 
transformation. The plasmid libraries were tested to confirm that ttiey contained a 
high proportion of inserts in the selected size fractionation range. 

Random colonies of the transformed bacteria were picked for sequencing 
from one or both ends of the genomic DNA insert. Any available method of DNA 
sequencing could be utilized, and dye terminator chemishry was preferred for its 
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optimum resolution of the heterozygotes. As the genomic DNA libraries were 
made from a pool of individuals and the DNA was size fractionated prior to 
preparation of the DNA library, each fragment in the library was sampled multiple 
times, but in almost every case each sequencing read from a given fragment is 
derived from a different DNA sample thus providing a depth of coverage of the 
DNA genomic sequences which otherwise would be unattainable. 

After sequencing of the fragments, the sequences were clustered after masking 
all known repeats. The sequences can be clustered using readily available sequence 
assembly programs, e.g. Phrap. The sequences of each cluster were compared and 
inspected for base differences, and candidate SNPs were identified at positions where 
each base was represented by a Phred quaUty score of >20. AU sequence variants 
other than SNPs, an estimated 20-25% of the total, were also noted. All SNPs, and 
other variants, which occurred in repetitive sequences were discarded and the 
remainder were entered into a candidate SNP database. 
Is A subset of the candidate SNPs were verified to confirm that tiie majority of 

1 the candidate SNPs identified by sequence analysis were informative. The 

2 verification was done using a PGR assay to amplify DNA from several individuals, 
plus a few pools of genomic DNA from distinct ethnic groups and tiie PGR products 
were sequenced using dye terminator chemistry for optimum detection of 

20 heterozygotes. The results, not shown, of the small-scale verification indicated that 
the identified SNPs were informative. 

In ttus manner we were able to identify the SNPs contained within the 
specific subset of DNA which was sequenced. Through reiterative use of the RRS 
method, we were able to identify the majority of the SNPs present in the human 

25 genome. The identified SNPs are Usted in Figure 2. 
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Fyam ple 2 
non<.raMnn nf SNP maps 

Each SNP was developed into an STS and mapped using the TNG panel by 
using the method of Stewart et al. (1997) Genome Research, vol. 1. pp. 422-433. 
Briefly, oligonucleotides for PGR ampUfication of the fragments containing the 
SNPs were chosen using PRIMER 3.0, a software package written at the Whitehead 
Genome Center. The oligonucleotide primers were chosen according to parameters 
that generate PGR products of 100400 base pairs in length and that allow the use of a 
single set of PGR conditions for all STSs. PGR products are assayed by ethidium 
bromide staining foUowing agarose gel electrophoresis. An STS containing an 
identified SNP is judged successful when the primers produce a distinct PGR 
product of the expected size from total human DNA, but fails to produce a distinct 
PGR product of this size from hamster genomic DNA. In addition, each successful 
STS is PGR ampUfied on a set of approximately 90 rodent-human somatic cell 
hybrids to assure that the STS maps to a unique human chromosome. Ethidium 
stained gel images were captured using a GGD camera system and captured data was 
automatically entered into our mapping database. 

The map location for each identified SNP is listed with the SNP sequence in 
Figure 2. 

Example 3 
SNIP profiling to identify an individual 

Oligonucleotides that recognize one allele of a SNP nucleic acid are 
immobilized on a filter. Preferably, the oligonucleotides comprise oligonucleotides 
complementary to at least 10 different SNP nucleic acids and are present on the filter 
in a pre-arranged array. Each fUter with bound oligonucleotides is placed in 4 ml 
hybridization solution containing 5x SSPE, 0.5% NaDodS04 and 400 ng of 
streptavidin-horseradish peroxidase conjugate (SeeQuence; Eastman Kodak). PGR- 
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amplified DNA made with biotinylated primers (20 microliters) from a sample of 
blood from ail individual is denahired by addition of an equal volume of 400 mM 
NaOH/10 mM EDTA and added immediately to the hybridization solution, which is 
then incubated at 55°C for 30 minutes. The filters are briefly rinsed twice in 2x SSPE, 
0.1% NaDodS04 at room temperature, washed once in 2x SSPE, 0.5% NaDodS04 at 
55»C and then briefly rinsed twice in 2x PBS (Ix PBS is 137 mM NaCl/2,7 mM 
KCl/8mM Na2HP04/1.5mM KH2PO4, pH 7.4) at room temperature. Color 
development is performed by incubating the filters in 25-50 ml red leuco dye 
(Eastman Kodak) at room temperature for 5-10 minutes. The result is 
photographically recorded and the pattern can subsequentiy be compared with 
another biological sample to determme whether the individual can be excluded as 
the source of the biological sample. 
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Example 4 

"I5 Anal ysis of H ipped reads 

All RRS reads were cUpped of sequencing vector and low quality ends, 
which set a usable read length for each read. The dipped reads were 
screened for repetitive sequence with RepeatMasker, using the default 
human settings. Only reads with >=80 non-repetitive bases and >=100 
20 Phred quaUty (Q) >=30 bases were used in this analysis. These RRS reads 

were assembled using phrap.manyreads. Contigs with 2 or more reads must 
be aUgned from a common starting point, the enzyme identified in the 
Production Protocol. High quality base discrepancies, Q>=23, were 
identified as candidate SNPs. Further resb:ictions on the candidate SNPs 
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were that its neighbouring 5 bases all had Q >=15, and that at least 9 
of these 10 neighbouring bases agreed with the consensus. If the number 
of detected SNPs in one clique was greater than 4 or the depth of the 
assembly (not including the genomic sequence) was greater than 5, then 
5 all SNPs were discarded for that contig. 

Example 5 

Q PCR confirmation of polymorphism 

^ PCR primers were designed to flank each candidate SNP, and the resulting 

Si 

?Jo fragment ampUfied from each of the DNAs used to construct the library, 
r SNPs were considered validated if at least two distinct genotypes were 
^ observed at the candidate position (or three, if a homozygous variant was 
S observed); in addition, no position could be heterozygous in all 
individuals, as this would indicate a repeat sequence. 

15 

Example 6 

RT. A ST analysis/compariso n nf base call and quality 
Each sequence was blasted to a Ubrary of known repeat sequences, and any 
read containing >50% of bases in repeats was removed. The remaining reads wer 
20 blasted against one another, and candidate pairs identified if they shared 
>80% sequence identity over at least 270 bases. These candidate 
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pairs were aligned using a modified Smith-Waterman alignment, and 

candidate SNPs identified (see below). Two filters were used to ensure high accuracy 
of declaring a sequence match, and to avoid inclusion of low-level repeat sequences. 
First, a pair was declared only if the sequences aligned over their entire length (save 
5 50 bp allowed on either end for sequencing end-effects), and no more than 1% of 

the bases in the alignment were candidate SNPs (see below). Second, pairs were 
then arranged into higher-order connected component groups (using 

y, transitivity). Component groups with more than 8 reads were removed. Paired 

2 sequences (see above) were run through the algorithm "SNPfinder", which 

i) compares the base-call and quality of each position. A candidate SNP was declared if 

ffl two basecalls were present, the Phred score of each was >20, and the 10 bases flanking 

K the SNP (5 on either side) were of Phred quality 

U >15. 

iu 
m 

Example 7 

Cloning and .sequencing to conf irm polymorphism 

A pool of 10 DNAs (the Pilot Panel) or 24 DNAs (the TSC Panel) was digested with a 
restriction enzyme, size fractionated on an agarose gel, and cloned mto M13-based 
vectors. Sequences were obtained on ABI 377 or 3700 sequencers. 

20 Base-calling was performed with Phrap. 
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